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A  Introduction 

Tools  that  analyze  software  are  exorbitant  to  develop,  yet  different  analysis  applieations  have 
quite  similar  infrastrueture  requirements.  An  effeetive  eost-reduetion  approaeh  is  to  amortize  the 
development  eosts  of  a  eommon  infrastrueture  aeross  multiple  subjeet  programming  languages, 
eomputer  platforms,  and  analysis  applieations.  This  is  the  final  report  of  a  one-year  project 
aimed  at  creating  such  a  common  program-analysis  infrastructure. 

B  Approach 

Our  approach  to  creating  an  effective  common  infrastructure  was  to  start  with  CodeSurfer  [2,  3], 
a  tool  that  was  originally  narrowly  conceived  as  just  a  program  understanding  system  for  ANSI 
C,  and  then  adapt  it  to  meet  the  needs  of  a  collection  of  representative  researchers  wanting  to  use 
it  for  other  applications.  Specifically,  we  worked  with  six  different  efforts  within  two  CIP/SW 
MURIs  at  the  University  of  Wisconsin  and  Carnegie-Mellon  University  (CMU)  managed  by  the 
Office  of  Naval  Research  (ONR),  and  four  separately  funded  GrammaTech  projects  with  quite 
similar  goals: 


# 

Application 

A,  Wisconsin  /  CMU  Projects 

B.  GrammaTech  Projects 

1 

Object  Code 
Analysis 

Analysis  of  COTS  executables  for  the 
Intel  x86  family  of  processors  (Reps, 
Balakrishnan) 

Analysis  of  firmware  for  the  Intel  x86 
family  of  processors  (AFRL/Rome 

SBIR  Phase-1  and  Phase-11  projects 
“Detecting  Malicious  Code  in 

Firmware”) 

2 

Buffer- 

Overrun 

Analysis 

Detection  of  buffer-ovenun 
vulnerabilities  in  C  source  code  (Jha, 
Ganapathy) 

Detection  of  buffer-ovenun 
vulnerabilities  in  C  source  code 
(AFRL/Rome  SBIR  Phase-1  and  Phase- 
11  projects  “Source  Code  Vulnerability 
Analysis”) 

3 

Object  Code 
Rewriting 

Binary  code  (x86)  rewriting  technology 
for  security  (Jha,  Miller,  Giffin,  and 
Christodorescu) 

Java  byte  code  (Jimple)  rewriting 
technology  for  security  (NIST  SBIR 
Phase-1  and  Phase-11  projects  “Inline 
Reference  Monitors  for  Object  Code”) 

4 

Model 

Checking 

(a)  Weighted  pushdown  systems  (Reps, 
Jha) 

(b)  Verification  of  properties  of 
concurrent  C  programs  (Clarke,  Sagar) 

Verification  of  path  properties  in  C  and 
C++  programs  (DARPA  SBIR  Phase-1 
and  Phase-11  projects  “Verification  of 
Hierarchical  Graph  Stmctures”) 

5 

Vims 

Detection 

Detection  of  vimses  in  binary  (x86) 
code.  (Jha,  Christodorescu) 

(none) 

C  Technical  Objectives 

Our  primary  goal  was  to  create  a  powerful,  flexible,  and  open  toolkit  for  static  program  analysis 
that  would  support  multiple  programming  languages,  multiple  computer  platforms,  multiple 
analysis  algorithms,  and  multiple  client  applications.  A  successful  design  would  maximize  code 
reuse,  i.e.,  minimize  the  amount  of  code  duplication  that  would  be  required  of  any  given  user  of 
the  toolkit. 
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Our  secondary  goals  were  (a)  to  provide  GrammaTech  technology  to  Wisconsin  and  CMU  in 
support  of  their  MURI  projects,  (b)  to  transition  research  results  from  those  projects  back  into 
GrammaTech,  and  (c)  to  support  early  adopters  of  the  transitioned  research  results  in  the 
Government. 

The  project  was  designed  to  be  a  win  for  each  of  the  three  parties  involved: 

•  Universities  would  get  access  to  high-quality,  supported  technology,  thereby  freeing 
them  to  focus  on  basic  research.  This  would  minimize  their  humdrum  engineering 
activities,  and  minimize  disruptions  involved  in  day-to-day  support  of  early  adopters. 

•  GrammaTech  would  get  access  to  world-class  researchers  and  their  prototypes  for  new 
cutting-edge  products,  as  well  as  feedback  from  early  adopters  to  guide  the  development 
of  those  products. 

•  The  Government  would  avoid  funding  wasteful  duplicate  work,  and  accelerate 
transitions  of  basic  research. 

D  Work  Items  Planned 

The  work  items  for  the  project  were  (1)  to  design  new  APIs  with  which  users  could  program 
their  static  analyses;  (2)  to  design  a  new  architecture  for  CodeSurfer;  (3)  to  implement  the  new 
architecture  and  APIs;  (4)  to  create  two  reference  implementations  of  static  analysis  algorithms 
using  the  new  architecture  and  APIs;  and  (5)  to  evaluate  the  reference  implementations  and  the 
experience  of  the  Wisconsin  project  to  assess  the  success  of  the  project. 

More  specifically,  we  anticipated  doing  the  following  work  to  meet  our  primary  objective: 

1 .  Multi-lingual  capahilities.  This  work  would  involve 

a.  Abstracting  language-specific  functionality  out  of  CodeSurfer  per  se. 

b.  Reinstantiating  pre-existing  versions  of  CodeSurfer  using  the  factored  language- 
independent  services  we  had  introduced. 

c.  Implementing  new  front  ends  for  one  or  more  other  programming  languages  to 
demonstrate  the  generality  of  the  new  architecture. 

2.  Builder  modularization.  This  work  would  involve 

a.  Decomposing  CodeSurfer’s  monolithic  builder  into  its  constituent  analysis 
components,  and  creating  an  open  API  for  each  such  component.  This  would 
allow  clients  of  the  system  to  program  their  own  analyses  by  writing  their  own 
code  using  the  API  in  whatever  phase  ordering  or  iteration  schemes  were  most 
suitable  for  their  own  applications. 

b.  Implementing  new  analyses  making  use  of  the  newly  exposed  components  to 
demonstrate  the  generality  of  the  new  architecture. 

3.  Back-end  extensions.  This  work  would  involve 

a.  Creating  additional  open  APIs  in  the  back-end  to  support  plug-in  applications. 

b.  Demonstrating  the  use  of  those  APIs  by  various  plug-in  applications. 

We  also  anticipated  the  following  work  to  meet  our  secondary  objectives: 
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•  Support  to  MURIs.  Discussions  with  the  MURI  researchers  to  elicit  their  requirements, 
interactions  during  an  iterative  design  cycle,  delivery  of  new  versions  of  software,  and 
bug  fixing  in  a  timely  manner. 

•  Transition  from  MURIs.  Adaptation  and  adoption  of  MURI  research  results  within 
GrammaTech. 

•  Outreach.  Discussions  with  prospective  early  adopters,  packaging  and  documenting 
results  in  a  distributable  form,  delivery  to  early  adopters,  and  support  to  them.  Writing  of 
co-authored  papers. 

•  Reporting.  Participation  in  bi-annual  MURI  reviews. 

E  Results 

Our  primary  technical  results  were  as  follows: 

1 .  Multi-lingual  capabilities. 

a.  We  developed  a  front-end  System  Development  Kit  (SDK)  for  CodeSurfer  that 
facilitates  creation  of  alternative  front  ends  for  different  programming  languages. 
The  SDK  contains 

i.  Abstract  datatypes  that  can  be  used  by  a  front  end  to  build  and  output  the 
intermediate  representations  needed  by  the  CodeSurfer  builder. 

ii.  The  notion  of  a  Language  Module  that  contains  all  language-specific  code 
and  data  needed  by  CodeSurfer. 

iii.  Complete  documentation. 

The  SDK  supports: 

i.  A  language-independent  abstract-syntax -tree  (AST)  framework.  The  AST 
representation  can  be  made  available  for  use  in  front  ends,  in  the 
dependence-graph  builder,  and  in  the  back-end  scripting  language. 

ii.  A  language-independent  control-flow  graph  (CFG)  definition  facility. 

iii.  An  abstract  datatype  for  source-position  information. 

iv.  Abstract  datatypes  in  support  of  pointer  analysis  (see  PAM,  below). 

b.  We  reinstantiated  our  pre-existing  ad  hoc  versions  of  CodeSurfer  for  C/C++  and 
Intel  x86  using  the  SDK. 

c.  We  used  the  SDK  to  implement  a  new  version  of  CodeSurfer  for  Jimple  [1],  a 
three-address  version  of  Java  byte  codes.  (This  work  was  funded  by  our  NIST 
SBIR  contract.) 

2.  Builder  modularization. 

a.  We  factored  CodeSurfer’ s  pre-existing  pointer  analysis  code  into  a  separate 
Pointer  Analysis  Module  (PAM)  that  can  be  used  independently  of  CodeSurfer. 
PAM  consists  of 
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i.  An  SDK  for  creating  the  intermediate  representations  needed  by  the 
pointer  analysis  engine. 

ii.  The  pointer  analysis  engine  itself 

iii.  An  API,  termed  the  Pointer  Analysis  Data  Base  (PADB),  for  aeeessing  the 
points-to  results  that  have  been  eomputed  by  the  pointer  analysis  engine. 

b.  We  had  originally  antieipated  modularizing  and  exposing  the  individual  analysis 
eomponents  of  the  CodeSurfer  builder,  thereby  allowing  users  to  instantiate 
different  “builders”  of  their  own  ehoosing.  However,  two  eoneems  on  the  part  of 
the  Pis  at  Wiseonsin,  performance  and  intellectual  property  rights,  led  us  to 
implement  a  quite  different  arehiteeture.  In  short,  Wiseonsin  wanted  a  light¬ 
weight  analysis  platform  for  x86  binaries  that  eould  be  severed  from  CodeSurfer 
altogether.  Aceordingly,  rather  than  modularizing  the  CodeSurfer  builder,  we 
were  asked  to  provide  an  effeetive  analysis  infrastrueture  wholly  within  the  x86 
front  end.  In  effeet,  the  front-end  SDK,  whieh  was  intended  just  to  faeilitate  a 
elient’s  aecess  to  the  analysis  capabilities  of  CodeSurfer’ s  builder,  beeame  the 
analysis  platform  itself  Some  features  of  the  CodeSurfer  builder,  e.g.,  basie 
bloek  analysis,  were  lifted  from  the  builder  and  replieated  in  the  front  end. 
Unfortunately,  this  was  at  odds  with  our  goal  of  minimizing  eode  duplieation. 

3.  Back-end  extensions.  Joint  work  by  GrammaTeeh  and  Wiseonsin  on  buffer-overrun 
deteetion  led  to  several  baek-end  extensions: 

a.  The  ereation  of  a  general-purpose  browser  for  viewing  the  results  of  eode  seans. 

b.  The  ereation  of  an  open  API  for  serializing  CodeSurfer  objeets.  This  extension 
was  needed  to  provide  a  persistent  representation  of  the  buffer-overrun  results. 

Our  aetivities  aimed  at  the  seeondary  objectives  were  as  follows: 

•  Collaboration  with  the  Wisconsin  MURI.  In  two  of  the  applieation  areas.  Object  Code 
Analysis  and  Buffer  Overrun  Analysis,  the  efforts  at  GrammaTeeh  and  Wiseonsin 
beeame  so  tightly  intertwined  that  it  is  not  appropriate  to  deseribe  those  aetivities  in 
simple  uni-direetional  terms  as  “support  to”  or  “transition  from”  the  MURI.  The  efforts 
beeame  true  eollaborations. 

o  Object  Code  Analysis.  Wisconsin  and  GrammaTeeh  collaborated  on  the 

development  of  x86fe  (a.k.a.  “the  connector”),  which  can  be  used  as  a  standalone 
x86  analysis  module,  or  as  a  CodeSurfer  front  end.  Roughly  speaking,  the 
division  of  labor  is  that  Reps  and  Balakrishnan  work  on  Value  Set  Analysis 
(VS A),  Affine  Relation  Analysis  (ARA),  and  Aggregate  Structure  Identification 
(ASI)  [4,  5],  and  GrammaTeeh  does  the  rest  including  ^ 

i.  Use/Def  Information.  Detailed  use/kill/conditional-kill  information  is 
provided  for  every  instruction. 


*  Note  that  some  of  the  x86fe  services  listed  were  implemented  in  a  second,  follow-on  contract  the  year  after  this 
contract  ended.  We  provide  the  complete  list  in  the  interest  of  being  clear  about  the  services  offered  by  x86fe  today, 
we  do  not  believe  that  it  is  essential  to  detail  the  exact  year  in  which  each  service  was  implemented  or  further 
perfected. 
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ii.  Register  Live-Range  Analysis  (RLRA).  The  live-ranges  of  registers  are 
computed. 

ill.  Basic  Blocks.  Basic  blocks  for  the  entire  application  (including  libraries) 
are  computed. 

iv.  Call  Graph.  The  call  supergraph  for  the  entire  program  (including 
libraries)  is  computed. 

V.  Support  for  Libraries.  A  repository  of  pre-processed  libraries  is 
computed,  from  which  individual  procedures  are  demand  loaded. 

vi.  Spill  Regions.  Regions  of  code  are  computed  in  which  registers  are 
"spilled"  into  memory  locations  (e.g.,  "mem  =  eax;  ...;  eax  =  mem;"). 

vii.  Register  Save/Restore  Instruction  Pairs.  Pairs  of  instructions  that  are  used 
to  save/restore  registers  at  a  call  (caller),  or  on  entry  to/exit  from  a 
procedure  (callee),  are  computed. 

viii.  Port  Analysis.  Pseudo-variables  are  created  for  all  ports  accessed  by  the 
program.  Port  references  are  determined  using  constant  propagation. 

ix.  Support  for  Multiple-Entry-Point  Functions.  Multiple-entry-point 
functions  are  detected  and  represented. 

X.  Support  for  Clones.  Multiple-entry-point  functions  are  optionally  cloned 
and  converted  to  single-entry-point  functions. 

xi.  Support  for  Non-linear  Functions.  Instructions  that  are  not  included  in 
any  function(s)  by  IDAPro  are  added  to  the  function(s)  that  end  up 
executing  them. 

xii.  Support  for  Import  Tables.  Functions  and  DLLs  that  are  imported  by  the 
program  are  detected. 

xiii.  Register  Aliases.  A  map  of  register  aliases  is  maintained  (e.g.,  al  ->  ax  -> 
eax). 

xiv.  End-to-end  Connectivity  with  CodeSurfer.  x86fe  and  CodeSurfer  are  kept 
in  synch. 

o  Buffer  Overrun  Analysis.  Wisconsin  and  GrammaTech  collaborated  on  the 
development  of  a  tool  for  the  detection  of  buffer-overrun  vulnerabilities  in  ANSI 
C  programs.  During  the  year,  we  co-authored  a  paper  describing  our  joint 
work  [6].  We  tested  the  tool  on  the  current  version  (2.6.2)  of  the  Washington 
University  FTP  daemon,  a  popular  file  transfer  server,  found  14  previously 
unreported  overruns,  and  reported  them  to  the  developers. 

•  Support  to  the  MURIs. 

o  PAM.  On  9/26/02,  Prof.  Jha  requested  that  GrammaTech  repackage  CodeSurfer’s 
pointer  analysis  module  as  a  stand  alone  component  (PAM)  for  use  in  a  joint 
CMU/Wisconsin  project  involving  Prof  Clarke  and  his  student  Sagar  Chaki.  The 
user  manual  was  delivered  to  CMU  and  Wisconsin  on  12/20/02  (in  Q3),  although 
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delivery  of  the  completed  software  did  not  occur  for  another  month  (in  Q4). 
Creating  PAM  involved  interacting  closely  with  Mr.  Chaki  during  the 
requirements,  design,  implementation,  and  deployment  phases  of  the  effort. 

•  Transition  from  MURIs. 

o  Model  Checking.  We  transitioned  the  work  of  Reps,  Schwoon,  and  Jha  on 
weighted  pushdown  systems  [8],  and  their  prototype  implementation  (Weighted 
Moped)  to  GrammaTech,  where  we  used  it  as  a  model  checking  engine  for  the 
Path  Inspector  [7],  a  tool  that  checks  sequencing  properties  in  programs.  The 
Path  Inspector  has  been  released  as  a  commercial  product. 

•  Outreach.  In  a  separately  funded  effort,  we  worked  to  transition  Wisconsin’s  research  to 
SSC-SD  (SPAWAR).  As  a  first  step,  we  trained  a  SPAWAR  employee  to  use 
CodeSurfer  and  the  prototype  Wisconsin/GrammaTech  buffer-overrun  vulnerability 
detector.  We  then  used  the  buffer-overrun  tool  to  analyze  the  GCCS-M  Tactical 
Management  Service  (TMS),  and  found  one  possible  overrun  in  this  fielded  program, 
albeit  not  an  overrun  that  can  be  exploited  to  seize  control  of  the  program. 

•  Reporting.  We  attended  the  semi-annual  MURI  review  in  Pittsburgh,  and  the  semi¬ 
annual  MURI  review  in  Williamsburg.  Tim  Teitelbaum  reported  on  GrammaTech’s 
activities  at  both  reviews. 

F  Conclusions 

The  project  confirmed  the  hypothesis  that  it  is  technically  possible  to  build  an  effective  common 

infrastructure  for  the  static  analysis  of  software  that  meets  the  needs  of  a  disparate  collection  of 

applications. 
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